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ABSTRACT 

We propose a domain-independent framework for searching 
and retrieving facts and relationships within natural lan- 
guage text sources. In this framework, an extraction task 
over a text collection is expressed as a query that combines 
text fragments with wild cards, and the query result is a set 
of facts in the form of unary, binary and general n-ary tuples. 
A significance of our querying mechanism is that, despite be- 
ing both simple and declarative, it can be applied to a wide 
range of extraction tasks. A problem in querying natural 
language text though is that a user-specified query may not 
retrieve enough exact matches. Unlike term queries which 
can be relaxed by removing some of the terms (as is done 
in search engines), removing terms from a wild card query 
without ruining its meaning is more challenging. Also, any 
query expansion has the potential to introduce false pos- 
itives. In this paper, we address the problem of query ex- 
pansion, and also analyze a few ranking alternatives to score 
the results and to remove false positives. We conduct exper- 
iments and report an evaluation of the effectiveness of our 
querying and scoring functions. 

1. INTRODUCTION 

The World Wide Web contains a vast amount of informa- 
tion and is a rich source for data extraction, but manually 
extracting data from the Web is a tedious and time consum- 
ing process, especially when a large amount of data matches 
the extraction criteria. Example extraction tasks include 
compiling a list of Canadian writers, finding a list of med- 
ications for a disease, etc. Unless such lists have already 
been compiled and made available on the Web, one has to 
query a search engine, examine the pages returned, and ex- 
tract a handful of instances from each page (if there is any 
at all). The problem is further complicated by the fiexibil- 
ity of natural languages. Consider the example of extracting 
Canadian writers; many bona fide writers are not referred 
to as writers. Instead, they are often coined as authors, nov- 
elists, journalists, etc. If only the phrase "Canadian writers" 
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is used in the query, many qualified instances will not be ex- 
tracted, thus the extraction quality is compromised. Many 
previous data extraction systems either focus on a more spe- 
cific task by imposing tight restrictions on the type of data 
that can be extracted (e.g. finding course offerings and job 
postings) or are only applicable to documents that follow a 
specific formatting (e.g. wrappers). For example, the Know- 
ItAU ^13j system can extract hyponyms of a user-specified 
class. The online prototype of the system is further extended 
to support a few more specific binary relations (e.g. X "ceo 
of" Y). A challenge facing many data extraction systems in 
general is that the extraction task often is not easy to define 
or a definition may not be accurate, leading to low precision 
and recall. Limiting the extraction task to a few predefined 
classes is one way to reduce the complexity of the problem. 

In this paper, we address the problem by introducing a 
framework that allows an extraction task to be encoded as 
a simple query. A query in this framework is a natural lan- 
guage sentence or phrascQ with some wild cards, and the 
result of a query is a ranked list of matching tuples. For 
instance, given the query "% is a car manufacturer", the 
output is expected to be a ranked list of car manufacturers, 
preferably the real car manufacturers ranked the highest. 
This query only uses one wild card, here denoted with %. 
In general, a query can use more than one % wild card, and 
the result of the query in this case is a table with one column 
for each occurrence of the wild card. 

Our first contribution is a declarative querying framework 
for data extraction. Integration of wild cards in our queries 
can generally reduce the number of queries that must be 
issued, hence simplifying the extraction task. In our ear- 
lier example about Canadian writers, for instance, a user 
can use one type of wild card to indicate that terms sim- 
ilar to writers should also be considered. Another type of 
wild card may be used to indicate a probable position of 
the desired data, from which values can be extracted. Com- 
bining such wild cards with natural language phrases can 
provide a simple but powerful interface, which can handle 
much more extraction tasks than previous systems. There 
is a close correspondence between our queries and star-free 
regular expressions; our queries make use of certain abstrac- 
tions geared toward natural languages which make it simpler 
to write queries. Since the result of a query is a relation, our 
queries with some syntactic sugar can be integrated in the 
from clause of SQL queries. 

Our second contribution is the idea of query expansion 



We assume query phrases are in English, but our framework 
should be applicable to other languages as well. 



through a set of declarative rewriting rules between para- 
phrases. Since a given user query may not retrieve an ade- 
quate number of facts, rewriting rules are generally expected 
to improve the coverage of the queries and the quality of the 
results. Our experiments, as reported in Section [S] show 
that increasing the number of rewritings can improve both 
recall and precision. 

As our third contribution, we address the problem of rank- 
ing in the context of rewriting rules. We propose a few al- 
gorithms for ranking the extracted results where each algo- 
rithm exploits some of the relationships that exist between 
the set of matching tuples and also between the set of query 
patterns. We use the general term pattern to refer to both 
query and query rewriting. When the results are ranked, 
it is possible to set a cutoff threshold to filter invalid rows 
from the result, making the final results more accessible to 
the user. 

Finally, as our last contribution, we theoretically analyze 
our algorithms and experimentally evaluate their behaviour 
in the setting of the Web. Our experimental results include 
comparisons with some alternative approaches to data ex- 
traction. 

The rest of the paper is organized as follows. Section [2] 
describes both the syntax and the semantics of wild cards, 
as well as the queries in our framework. An overview of 
our query evaluation in the context of the Web is given in 
Section [31 Section |3] discusses the details of our rewriting 
rules and patterns. Our ranking algorithms are discussed in 
Section O Experimental results are presented in Section [S] 
and the related work is reviewed in Section [7] We end the 
paper with conclusions and future work in Section [S] 

2. WILD CARD QUERIES 

The use of wild cards is prevalent in many areas of com- 
puter science, with examples found in SQL, operating sys- 
tem shells and scripting languages such as Perl, Awk and 
Python. Unlike many of these systems, our introduced wild 
cards iterate over the domains of parts of speech or other 
meaningful natural language word groupings. In particular, 
we introduce two types of wild cards, namely * and %. 

% wild card: The % wild card represents one or more 
noun phrases. A noun phrase may consist of one or mul- 
tiple words; for instance, "movie" and "action movie" are 
both noun phrases. This wild card, when used in a query, 
indicates the location of a noun phrase or noun phrases 
that should be extracted. For example, the query "summer 
movies such as %" will extract noun phrases Harry Potter, 
Shrek, and Spiderman from the following sentence: "Popular 
summer movies such as Harry Potter, Shrek and Spiderman 
appeal to audience of all ages." 

* wild card: The * wild card represents a set of phrases 
with the same or similar meanings to a given phrase. Con- 
sider the task of finding a listing of summer movies; we may 
type the query "% is a summer movie". However, some 
bona fide movies are often referred to as "films", "block- 
busters", and so on. In a naive approach, one may have to 
try other terms similar to "movie" manually, save the results 
each time, and put the results together at the end. The 
naive method is tedious and inefficient. In our queries, a 
term may be enclosed within a pair of * to instruct that the 
search should be extended to include terms and phrases sim- 
ilar to the given one. For example, the user can re-formulate 



the query as "% is a summer *movie*", and the query will 
be automatically expanded to include related queries such 
as "% is a summer film", "% is a summer blockbuster", etc. 

It is feasible to consider other wild cards. For instance, we 
could have wild cards that only match verbs, adjectives, or 
a union of nouns, verbs and adjectives. It is also possible to 
have a wild card that matches a prefix or suffix of a term or 
a fixed-length sequence of terms. In an attempt to keep the 
syntax of our queries simple, this paper only considers the 
two wild cards % and *, as discussed above. The following 
is a list of example queries: 

• % is a ^country* 

• % is a summer ^blockbuster* 

• % invented the light bulb 

• Google ^acquired* % 

A query may use any number of wild cards. Given a query 
with k % wild cards, the result of the query is a table with k 
columns, one for each % wild card. It is natural to assume 
that the result of a query is ranked and each row is assigned a 
score as an indication of the level of support the row receives. 
This score may depend on the size and the coverage of the 
text collection that is being queried and the set of query 
rewritings that are being used. Section [5] discusses a few 
measures to rank the matching tuples of a query. 

A query can have any number of * wild cards. Given a 
query q with some * wild cards, let qi, . . ., qs be the set of 
queries that are obtained by replacing each * wild card with 
similar terms. A row matches q if it matches at least one of 
qi, . . ., qs; the score of the row for q may be an aggregation 
of the scores of the row for gi, . . ., qs. For our purpose, two 
terms are considered similar if they have the same meanings 
(e.g. synonyms), one is a generalization of the other, or the 
two terms can be used interchangeably in the same context. 
The similar terms can be often obtained from dictionaries, 
thesaurus, online corpus [18], etc. 

3. EVALUATING WILD CARD QUERIES - 
AN OVERVIEW 

This section provides an overview of evaluating wild card 
queries over a text corpus. Without loss of generality, our 
discussion in this section is centered around the Web and 
uses the query "% is a *blockbuster*" as an example. In par- 
ticular, we consider the scenario where the extraction engine 
is built on top of a search engine. Naturally the same steps 
can be taken when the source data is stored locally, with a 
difference that a local collection may be better indexed and 
the queries can be better optimized. To limit the scope of 
the paper, we do not address the issues related to indexing 
and query optimization. 

Step 1 - query flattening As the first step, the query is 
analyzed and expanded by replacing the words enclosed by 
pairs of * wild cards (if any) with their similar terms. In 
the given query, the word "blockbuster" is enclosed by *'s; 
similar words to blockbuster, based on an online system [18] . 
are "movie" and "film". Two new queries, "% is a movie" 
and "% is a film" are formed and added to the expanded 
query set. In general, more queries may be added if multiple 
synonyms are found. 



step 2 - query rewriting In the next step, each query in 
the query set is passed to a Part-Of-Speech (POS) tagger. 
Each tagged query is compared with a set of precompiled 
patterns for possible rewritings. The result of query tag- 
ging is not always reliable, in particular for short queries. 
To account for those cases, queries are also rewritten using 
rules that do not require tagging. Let's consider the query 
"% is a movie" first; after tagging, the query conforms to 
the pattern "NPl is a(n) NP2" where NP stands for a noun 
phrase; note that the wild card % matches a noun phrase, 
as we defined earlier. A pattern may be found relevant to a 
pre-determined class, based on a specific relationship it de- 
scribes, and may be rewritten by other patterns in the same 
class. The pattern "NPl is a NP2", in particular, belongs to 
the hyponym class, since the template indicates that NPl is a 
(hyponym of) NP2. Other patterns in the hyponym class in- 
clude "NP2 such as NPlList", "NP2 including NPlList", etc. 
All patterns in the matching class (i.e. the hyponym class) 
are instantiated according to the matched query. Thus, the 
query set is expanded with extra queries like "movies such 
as %" and "movies including %". Section |4] discusses our 
query rewritings in more detail. Similarly, the query "% is a 
blockbuster" also matches a pattern in the hyponym class, 
and the query set is further expanded. If the query cannot 
match any pattern, no query expansion will occur at this 
step. 

Step 3 - information retrieval engine As the third step, 
all queries in the query set are sent to a search engine. For 
each query, the matching snippets are downloaded for fur- 
ther processing. When there is a large number of matches, 
only a fixed number of them are selected. HTML tags are 
stripped from downloaded snippets for each query and the 
remaining text is analyzed to identify the pieces that match 
the query. Noun phrases that appear in the positions of 
% wild cards of a query are extracted from the text and 
are saved in the result set. Words other than noun phrases 
should not be extracted even if they appear in target loca- 
tions. Suppose the query "% invented the light bulb" is sent 
to a search engine and the following two snippets are among 
those returned. 

• Thomas Edison is often said to have invented the light 
bulb. 

• We all learned in our history classes that Thomas Edi- 
son invented the light bulb in 1879. 

The POS tagger identifies that the word "have" in the first 
snippet is not a noun phrase, while "Thomas Edison" from 
the second snippet is. Therefore, the phrase "Thomas Edi- 
son" is extracted but "have" is not. 

Step 4 - relevance ranking The result of extraction in the 
previous step is a set of rows; for the given example, each 
row is a noun phrase. A ranking algorithm is applied to 
the extracted set. Section [5] gives the details of our ranking 
algorithm. Finally, a sorted list of rows is returned. The rest 
of this paper will focus on query expansion and relevance 
ranking. 

4. REWRITING QUERIES 

A challenge in querying and information retrieval from 
natural language text is the possibility of a mismatch be- 
tween the expressions of queries and texts that have the 



relevant information. A fact can potentially be expressed in 
many different contexts and a query that gives one context 
can miss many qualified candidates. We propose rewrit- 
ing rules to express the set of transformations that can be 
applied to a wild card query leading to alternative query 
expressions that are expected to return the same or seman- 
tically related results but are syntactically different. Query 
rewriting is expected to increase recall, as it can be seen 
from our example in Section |3] Our experiments in Sec- 
tion [S] confirm that there is also a correlation between the 
number of rewritings and the precision of the retrieved re- 
sults. There are more evidence in favor of rewriting natural 
language questions. For example, at TREC-10 QA evalua- 
tion, the winning system used an extensive set of rewritings 
as its only resource |22) . 

4.1 Rewriting Rule Language 

The rewriting rule language lists different ways of rewrit- 
ing a query. Each rule here is of the form rule-head —> rule- 
body. A rule head consists of one or more regular expressions, 
and a rule body consists of one or more rewritings with place 
holders. Multiple regular expressions in the head or rewrit- 
ings in the body can be listed each in a new line. A rule 
matches a query if any one of the regular expressions in the 
head matches the query. When a rule matches a query, the 
query is expanded with all rewritings in the rule body. For 
each match, keywords from the query may be remembered 
using capturing groups (e.g. parentheses) and the remem- 
bered values may be recalled using back references in the 
rule body. The remembered values can be transformed (e.g. 
from a singular noun to a plural noun) before being used 
in the rewritings. Each transformation is usually done by 
looking up a table, from a set of tables compiled in advance 
from a dictionary, thesaurus, and other ofiline sources. This 
allows us to write generic rewritings that can match a large 
number of queries. Our rewriting rule language is extend- 
able and one can add more rules as they become available. 
Here is an example of a rule. Given the query "movies such 
as %", the following rule generates "%, and other movies" 
and "% is a movie" as possible rewritings. 

(.+),? such as (.+) 
(.+),? including (.+) 

$2, and other $1 && plural($l) 
$2 is a $1 && singular ($1) 

Let n denote the average number of rewritings produced 
for each query. If a query uses k star wild cards and each of 
these wild cards is replaced with m > similar terms on av- 
erage, the query expansion would produce m'°(n + l) queries. 
To appreciate the power of the rewriting rule language, con- 
sider an alternative scenario where one has to enumerate 
and try many of those queries manually. 

4.2 Compiling Rewriting Rules 

Gathering a comprehensive set of rewritings for a large set 
of possible queries in advance can be challenging, especially 
if not much is known about the queries in advance. However, 
many short queries fall into some sort of common templates 
for which generic paraphrasing rules usually apply. In En- 
glish, two classes of generic rules that we can easily identify 
are hyponyms and morphological variants. A hyponym pat- 



tern describes lexico-syntactic relations that can be used to 
infer one element is a hyponym of another within a sentence. 
Hearst gives a list of hyponym patterns [TS]. A sample of 
hyponym patterns (either from Hearst's or hand-crafted by 
us) can be found in Table [T] 



NPl {,} "such as" NP2List 
"such" NPl "as" NP2List 
NPl {,} "especially" NP2List 
NPl {,} "including" NP2List 
NP2List "and other" NPl 
NP2List "or other" NPl 
NP2, "a(n)" NPl 
NP2 "is a(n)" NPl 
NPl NP2 

Table 1: Hyponym patterns 

The morphological variants of verbs are useful for rewrit- 
ing many queries that contain verbs. A given query may 
be rewritten by simply changing its verb tense and with- 
out much affecting its meaning. Many extraction tasks are 
expressed in the form of "Subject transitive- Verb Object" 
which can be rewritten in a passive form and vice versa. 
For example, if a user wants to find out who invented the 
light bulb, she can express the extraction as "% invented the 
light bulb"; a rewriting of the query is "the light bulb was 
invented by %". Our morphological patterns enumerate dif- 
ferent verb tenses (e.g. present tense, past tense, . . . ) and 
use both active and passive forms. We were able to express 
all the relationships described by patterns in the hyponym 
and morphological classes as rules in our rule language. 

Although generic patterns and rewritings can be applied 
to a wide range of queries, it is not hard to find queries 
where no generic pattern is applicable or sufficient. With 
a rule language, many of these rules can be expressed and 
regularly revised as more queries are seen. There is also an 
active research on automatically generating paraphrases in 
more specific settings [191 121). which may also be encoded 
as rules in our framework. 

5. BRINGING ORDER TO RESULTS 

The data extraction process discussed in this paper can 
accumulate a large set of candidates. Not all candidates are 
correct, meaning that a user would not consider them as 
good matches. 

5.1 Sources of Errors 

A query typed by user can match many non-relevant rows, 
in the sense that they may not be anticipated by the user. 
For instance, the query "% is a country" matches the state- 
ment "Joe is a country singer," thus Joe will be added to the 
list of country names. Through a heavy natural language 
processing (NLP), one may reduce the number of those false 
positives; however, these techniques are generally computa- 
tionally intensive and may not scale up well to large volumes 
of data and queries (e.g. on the Web). In general, broad 
queries are likely to match more false positives. Rewriting 
queries can also broaden the queries and introduce addi- 
tional false positives. Also false positives galore on the Web; 
since the published content may not be verified for correct- 
ness, statements, such as "New York is the capital of the 
United States" are not rare. 



One more source of error is due to using a POS tagger. Al- 
though POS taggers produce good results most of the time, 
the accuracy usually depends on factors such as the num- 
ber of words in the lexicon, etc. Sometimes verbs are mis- 
classified as noun phrases, or vice versa. Since correct ex- 
tractions are inter-mixed with errors, it is important to rank 
all candidates in terms of their relevance to the user query. 

5.2 Ranking Heuristics 

Given a query and a set of rewritings, we want to find 
out meaningful ways of ordering the results. In this section, 
we present a few heuristics before we analyze them within a 
more general ranking scheme in the next section. 

Number of Matched Pages or Documents (NPages): 

The relevant matches of a query are likely to appear fre- 
quently within the query pattern or one of its rewritings. 
One heuristic is to rank a tuple based on the number of 
pages or documents in which it matches the query or one of 
its rewritings. 

Mutual Information (MI) Another ranking scheme which 
has been used previously to quantify the relationship be- 
tween two random variables is the Mutual Information (MI). 
If we denote the probability that a document contains the 
text of query q (ignoring wild cards) with P{q), the prob- 
ability that a document contains a candidate r with P{r), 
and the probability that a document contains a proper en- 
coding of r in g with P{q,r), then the mutual information 
between q and r is defined as 



M/(g,r) =, 



P{q, 



P{q)P{r)' 



In some formulations of the mutual information, the above 
formula is multiplied by P{q,r) [9]. This measure is used 
in the past, for instance to evaluate the association between 
words (TTj, and also between the instances of a class and a 
discriminative phrase [13] • In our case, since P(g) is fixed for 
a given query, the score of a candidate can be estimated as 
the ratio P{q, r) / P{r). For a given query and candidate, the 
MI measures the conditional probability that the candidate 
appears within the query template given that the candidate 
appears in a document. 

Number of Matched Patterns (NPatterns) Another 
simple ranking is to count for each candidate, the number 
of different patterns (including the query and its rewritings) 
that would extract the candidate. Because of the semantic 
relationship between a query and its rewritings, if a candi- 
date is retrieved by multiple patterns, then there is probably 
a good indication that it is indeed a good match. 

Discussion of Ranking Heuristics Our experiments com- 
paring these heuristics (as reported in Sec. |6]) show mixed 
results for different queries. A general drawback for NPages 
and NPatterns is that all query patterns are treated equally 
important. With NPages, some correct instances that are 
not frequent cannot be well-separated from false positives. 
Also under NPages, the scores would be inflated when there 
are duplicates (such as those on the Web). A drawback for 
MI is that a candidate may not appear with the query but 
it may appear with one of its rewritings. Selecting a single 



^The query would not have been issued in the first place if 
it is assumed otherwise. 



pattern is not guaranteed to achieve a high recall. Also it 
is not clear how MI can be extended to account for mul- 
tiple rewritings of a query and also queries with multiple 
extractors. 

5.3 Relationship Graph between Patterns and 
Tuples 

Let Sp denote the set of query patterns (i.e. the user- 
specified query and all of its rewritings) and St denote the 
set of matching candidate tuples. Consider the bipartite 
graph G formed between Sp and St with an edge from 
p £ Sp to t £ St if i matches p in some text. In some 
settings, the edges of the graph may be assigned weights to 
indicate the degree of a match. We define a ranking F as 
a function that maps St to an n-dimensional vector where 
n is the size of St- Some of our earlier heuristics can be 
seen as special cases of this ranking function. In particular, 
N Patternsit) — indegree{t) and 

NPages{t) = ^ w{p t) 

peSp 

where w(p t) is the number of pages that give rise to a 
match between t and p. 

A limitation of NPatterns and NPages is that all patterns 
have the same infiuence on the scores. Our observations, 
however, indicate that often there are a few patterns that 
retrieve many good quality rows while the rest retrieve many 
false positives. To remedy the problem, we propose weight- 
ing the patterns and propagating the weights to the match- 
ing candidates. The weights may be assigned at the same 
time rewritings are compiled if they are available and are not 
expected to change much. For generic rewritings, however, 
the weights can change from one query to next and a more 
dynamic weighting scheme may be preferred. 

A hypothesis is that good tuples and good patterns exhibit 
a mutually reinforcing relationship: a good tuple is extracted 
by rnany good patterns; a good pattern extracts many good 
tuples Q For example, if Canada is indeed a good match 
for the query "% is a country", it should be extracted by 
many good related patterns, such as "countries including 
%", "such countries as %", and so on. Similarly, if "countries 
such as %" is indeed a good pattern for extracting country 
names, it should extract many good instances, like the U.S., 
Canada, China, etc. Our experiments use an adaptation of 
Kleineberg's HITS algorithm TfT, but there are indications 
that other two-level infiuence propagation algorithms can 
equally be used. 

Algorithm PT-hits Let's associate weight WT{t) to each 
candidate tuple t, and weight wp{p) to each pattern p. In an 
iterative and alternating fashion, the weights can be updated 
as follows: 



WT{t) = ^ Wp{p) 

{p\p extracts t} 

Wp{p) = ^ WT{t) 

{t I £ is extracted by p} 



(1) 



(2) 



The weights can initially be all set to 1, then propagated 
from patterns to tuples and vice versa. After each iteration, 

^The same hypothesis is made in a hyperlinked environment 
where several ranking algorithms are proposed for finding 
authoritative Web pages. 



the weights are normalized to keep them within a bound and 
also to check the convergence. It is easy to show that this 
iteration converges. 

5.4 Analyses of the Rankings 

Some of the scoring functions presented in Section [531 have 
desirable properties that are useful in the setting of the Web. 
One such property is monotonicity, i.e. for every pair of tu- 
ples ti and t2, if every pattern p that extracts t\ also extracts 
t2 and w{p — > ii) > w{p t2), then WT{ti) > WT{t2). It 
is easy to show that the functions NPages, NPatterns and 
PT-hits are all monotone. 

Two other interesting properties are stability and locality. 
A scoring function is stable if small changes in the graph 
structure between patterns and tuples have no effect on the 
scores of the tuples, and a scoring function is local if re- 
moving an edge has no effect on the weights of non-adjacent 
tuples. With data on the Web regularly changing, it is desir- 
able to know if the scores would change with small changes 
on the Web. More formally, let Qm.n be a collection of bi- 
partite graphs over a set of m patterns and n tuples, F{G) 
denotes the weight vector of tuples after a scoring function 
F is applied to a bipartite graph G, and d() be a function 
measuring the distance between two weight vectors. 

Definition 1. A scoring function F is stable on Qm.n un- 
der distance function d if for every fixed k, we have 



lim 

m or n — >oo Gf^Qri 



max d(F(G),F(G\{ei,...,efc})) = 

,,ei,...efcS-B(G) 



where E{G) denotes the edges of G and G\{ei, . . . , e^} is 
the graph obtained after removing edges ei , . . . , of G. 

Definition 2. A scoring function F is local if for every 
graph G £ Qm.n and every edge e £ E{G), if wi = F{G) 
and W2 = F{G\e), then for all tuples i and j which are not 
adjacent to e, the ordering between i and j under wi and W2 
remains unchanged. 

Theorem 1. The scoring function NPatterns is stable un- 
der both Kendall tau distance and Manhattan distance be- 
tween the weight vectors. 

Theorem 2. NPages is stable under the Kendall tau dis- 
tance. NPages is also stable under the Manhattan distance 
if w{p —* t) bound to a fixed constant. 

The proofs for Theorems [1] and [2] can be found in the ap- 
pendix. 

Theorem 3. The scoring functions NPatterns and NPages 
are both local. 

Proof. The locality follows from the fact that removing 
an edge e only affects the score of a tuple that is linked by 
e. □ 

Based on the results reported for HITS [5], PT-hits is neither 
stable nor local. 

6. EXPERIMENTS 

To evaluate our querying framework and ranking algo- 
rithms, we built a search tool, called DeWild, which relies 
on the Web as its source of dataQ. Using the Web compared 

'^DeWild stands for Data Extraction using Wild cards. The 
system is available online at dewild. cs .ualberta. ca 



to a closed collection has both benefits and drawbacks. A 
benefit is that its information redundancy can compensate 
for the relatively small size and coverage of our rewriting 
rule set and the lightweight NLP techniques used. A draw- 
back is that the text collection often is not clean and there 
are many bogus tuples that need to be filtered. 

DeWild takes advantage of existing commercial search en- 
gines and currently queries Google via its search API. In our 
experiments, 200 snippets are downloaded for each extrac- 
tion pattern (our online system only downloads 30 snippets 
per pattern and also lists the set of extraction patterns that 
are tried). If there are fewer than 200 snippets found for a 
pattern, then all available snippets are downloadecfl. The 
snippets returned by a search engine typically consists of 
the search query and its surrounding text. Since the tar- 
get data appears immediately before or after the user query, 
they can be often extracted using the snippets only (with- 
out downloading the actual pages), hence network and pro- 
cessing costs are significantly reduced. Though it should 
be noted that a snippet may lack sentence boundaries, and 
this can reduce POS tagging accuracy. For our experimen- 
tal comparisons, we implement all heuristics discussed in 
Section [5] However, if it is not explicitly stated otherwise, 
DeWild uses PT-hits as its native ranking. 

A publicly available POS tagger called NLProcessoiQ is 
used to identify the part of speech from retrieved text, so 
that only noun phrases are extracted for % wild cards. For 
* wilds cards, our system uses a collection of related words 
automatically compiled [T§] from Wall Street Journal cor- 
pus, but it can equally use other collections as well. Next 
we report our experiments with DeWild. 

6.1 Effectiveness of the Queries 

To measure the effectiveness of the proposed querying 
framework in expressing different extraction tasks, we used 
the tool to express and answer natural language questions 
seeking short answers. We randomly selected 100 natural 
language questions from the AOL query log [20], with the 
selection criteria that the queries were all started with one 
of the wh-words what, which, who and where and were fol- 
lowed by is, was, was or were. The break-down of the queries 
were as follows: what-84, who-6, where-7 and which-3. The 
why queries were excluded since these queries often seek long 
answers and cannot be expressed in our framework. Even 
among the selected queries, not all queries were expected to 
have short noun phrases as answers (e.g. What is a hedge 
fund? Or, what are the causes of poverty?); in fact, 58% of 
the queries were in this category, and they could not be prop- 
erly expressed as queries. Another 7% of the queries were 
ill-formed search queries in the form of questions which we 
didn't expect them to have answers (e.g. what is my ac- 
count balance?). We removed both classes of queries before 
further considerations. We further removed another 2% of 
the queries where we couldn't find an answer after an exten- 
sive search on the Web, and we were not sure if the questions 
had an answer at all. The remaining 33% of the queries were 
all expressed as queries in our framwork and were evaluated 
using our online tool. 

As a baseline for comparison, we also passed the queries as 



^Our online demo downloads at most 30 snippets for each 
query and each rewriting to keep the response time short. 
WWW. inf ogistics . com/textanalysis .html 



Question length 





< 6 terms 


> 6 terms 


all 


DeWild 


0.88 


0.41 


0.68 


OpenEphyra 


0.41 


0.22 


0.32 



Table 2: Precision of the results for natural language 
questions 

they were expressed to OpenEphyra Q a fuU-fiedged open- 
source question answering system. Similar to ours, OpenE- 
phyra uses the Web, and in particular the Google search 
engine, to evaluate the questions, hence both DeWild and 
OpenEphyra had access to the same data collection. How- 
ever, OpenEphyra does much more extensive parsing of the 
questions and the answers, and has components for detect- 
ing answer types (of questions) and named entity classes in 
text. 

Top 5 results of each query, as returned by each compar- 
ison system, were given to two annotators who were asked 
to check of the set included a correct answer to the original 
question. Table[2]shows the precision, averaged over two an- 
notations per question, for both systems. DeWild achieves 
a much better precision, with a good annotator agreement 
(kappa) of 0.66, for both short and a bit longer questions, 
retrieving a correct answer for twice as many queries as 
OpenEphyra. 

6.2 Recall and Precision 

In general, it is difficult to measure recall on the Web since 
we often do not know the full answer set. The answer set 
may not be all on the Web, or it can be scattered in many 
pages of which some may not be crawled or indexed by a 
search engine. To measure both recall and precision under 
these constraints, we decided to extract instances of some 
known classes. 



Pattern 


Weight 


US states, including % 





2514 


US states such as % 





1826 


% and other US states 





1766 


% is a US state 





0992 


such US states as % 





0962 


US states, especially % 





0789 


% or other US states 





0658 


% is the US state 





0440 


US states % 





0052 


%, the US state 







US state % 







%, a US state 








Table 3: Matching patterns for the target US state 
names, with the weights computed by PT-hits 

In one experiment, we used the names of 50 US states as 
the ground truth and tried to retrieve and rank the same 
data using our heuristics. The query was formulated as 
"US states such as %"; Table [3] shows the extraction pat- 
terns which matched our initial query, after instantiating 
"US states" in our generic patterns, as well as their weights 
computed by PT-hits. Clearly, any of the patterns in the ta- 
ble could have been used as a query and the result returned 

^http://www.ephyra.info 



by DeWild would have been the same. In our evaluation, 
a retrieved candidate was treated "correct" if it was either 
a full state name or an abbreviation. Figures [TJa) and (b) 
show precision and recall for our heuristic (PT-hits, NPages, 
NPatterns) as well as MI and Google for comparison. For 
Google, we tried the queries both with star in place of the 
wild card and without, since Google treats star as a spe- 
cial character and highlights the matching terms. To do 
a ranking using mutual information (MI), we could use ei- 
ther the query or one of its rewritings. Since it is not clear 
which rewriting performs the best, we ran the algorithm 
three times with the discriminative phrases "US state of X", 
"US states such as X", and "X is a US state". For clarity of 
the figure, we only plotted the one performing the best for 
all recall values. KnowItAU [TH] was initially using MI for its 
ranking, but it's configuration was changed over time, hence 
we ran our search on the online systemlj as well. The results 
for KnowItAU must be treated with a grain of salt though, 
since it was the only system using its own local collection. 

Many of our comparison systems show a precision of 1 
(or very close to 1) for small recalls, meaning that they all 
retrieve a couple of correct state names on top. The precision 
generally drops as more state names are discovered. PT-hits 
performs the best for this particular query, retrieving more 
than 80% of the state names without a single error. 

Figures [1] (c) and (d) show the comparison between the 
same systems with the extraction target set to the coun- 
try names, compiled by the US State DepartmenlQ, as the 
ground truth. The query "countries such as %" was used 
for the extraction in DeWild. Many of the systems except 
MI perform very well, with PT-hits again showing the best 
performance. This could be partly because of the high qual- 
ity of search engine results for this particular query. MI 
performs the least favorable for low recalls, and this could 
be due to the inaccuracy of the search engine results to our 
count queries. 

6.3 Number of Rewritings 

Adding each query rewriting introduces some cost at the 
query processing time, and a question is if this additional 
cost is justified. To evaluate the efi'ect of the number of 
rewritings on the precision, we used PT-hits to compile a 
list of "US states" but varied the number of rewritings that 
were used. We chose random sets of 2, 3, and 4 patterns 
from Table |3] and ran PT-hits each time with only one of 
these sets. Each experiment was repeated three times, each 
with a different random pattern selection. Figure [2] shows 
the average precision at each position. The average precision 
improves when the number of patterns increases from 2 to 
3. The precision further improves significantly when the 
number of patterns is increased from 3 to 4. Recall can 
either stay the same or increase using more rewritings but 
can never deteriorate. 

6.4 TREC Questions 

Finding candidate answers for natural language questions 
in a closed corpus can be challenging, especially if answers 
are phrased differently than questions, and this is usually 
the case in the question answering (QA) track of TREC. A 
solution adopted by some QA systems (e.g. AskMSR [6]) 
is to find question answers in a larger collection such as the 

^www. cs .Washington. edu/research/knowitall 

"www. state ■gov/www/regions/independen t_states -html I 



US states 




Figure 2: Average precision at each position for PT- 
hits, varying the number of patterns for target US 
states 



Web and project the answers to a TREC corpus to verify 
their correctness. If a question is formulated as a DeWild 
query, we can use our approach to locate the answer from 
the Web. 

For our evaluation, we took the first five QA targets from 
the TREC 2004 dataset [23; since a QA target consisted of 
multiple questions, we ended up with a total of 22 questions 
in the experiment. For each question, we report the num- 
ber of correct answers given by TREC, the rank of the first 
TREC answer under DeWild, and the number of overlaps 
between TREC and DeWild. The result of the evaluation is 
presented in Tabled 

For 64% of the questions (i.e. 14 questions), all answers 
returned by TREC were also returned by DeWild in posi- 
tions 1 to 3. For 18% of the questions, either we didn't have 
an obvious formulation of the questions or our formulation 
did not retrieve any matching candidate to TREC. These 
questions are marked with "na" in the table. We found out 
that the TREC answers for question 1.3 were not the ground 
truth on the Web; therefore there was a small overlap be- 
tween TREC and DeWild. For question 5.4, which asked 
for the CEO of AARP, TREC had "Horace Deets" or "Tess 
Canja" as the correct answer; this was based on the informa- 
tion in year 2004. At the time of running our experiments, 
the correct answer was "Bill Novelli", and DeWild extracted 
the more up-to-date correct answer. 

DeWild sometimes returned additional instances of which 
some were correct and others were incorrect but appeared 
with the query and gave additional information. For in- 
stance, consider the question "Who discovered prions?" from 
TREC which has only one correct answer. We transformed 
the question to "prions are discovered by %" and passed it 
as a query to DeWild. The highest ranked instance, "Stan- 
ley Prusiner", was the correct answer to the question, and it 
also received a substantially larger weight than the second 
best instance. Our system returns other acceptable answers, 
including the 8th-ranked "Dr. Stanley Prusiner", the 9th- 
ranked "researcher Stanley Prusiner", and the 12-th ranked 
"Nobel Prize winner Stanley Prusiner". These other answers 
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Figure 1: Precision and recall with the extraction target set to the US states in (a) and (b) and country 
names in (c) and (d) 



show that Stanley Prusiner was a doctor, a researcher, a 
Nobel prize winner and he was from the University of San 
Francisco. 

6.5 Ad Hoc Data Extractions 

As our last experiment, we tried to compile useful resource 
lists which we could not find in a list format anywhere on 
the Web. In one case, we tried to find the names of sum- 
mer movies. Although some online resources maintain a 
quite complete list of movies, they don't classify movies as 
summer movies or otherwise. The pattern "% is a summer 
*blockbuster*" is used as the query for the task. The term 
blockbuster, which is enclosed by * wild cards in the query, is 
augmented by two extra related terms: movie and film. We 
manually evaluated the extracted results using the Internet 
Movie Database (IMDB) and concluded that all the results 
in the top 10 were indeed correct movie names, and their 



release dates were in the summer. 

In one more experiment, we used the query "% is a Cana- 
dian writer" to compile a list of Canadian writers. This time, 
we put together a set of rewritings that were specific to the 
query. The query returned over 1300 names. We could ver- 
ify that 91 of the first 100 rows were real Canadian writers. 
Of the first 200 rows that we verified, 156 were real Cana- 
dian writers. We also compared the first 200 tuples to two 
of the most comprehensive online lists of Canadian writers 
that we could find. DeWild retrieved 86 real author names 
which could not be found in one listPI and 70 names which 
were not in the other listF^. After combining the two lists, 
DeWild still reported 58 names which we couldn't find in 
the combined list. This experiment shows that our queries 
can be used to compile a reasonably good list of resources 

'^''www. trackO . com/ ogwc/authorSj 

^^ www.um anitoba . ca/ canlit/ authorlisti 



Question 


TREC 


DeWild results 




ans. 


rank(lst overlap) 


# of overlaps 


1.1 


1 


2 


1 


1.2 


1 


2 


1 


1.3 


14 


3 


5 


1.4 


1 


na 


na 


1.5 


1 


1 


1 


2.1 


1 


1 


1 


2.2 


1 


3 


1 


2.3 


5 


1 


3 


2.4 


1 


3 


1 


3.1 


1 


2 


1 


3.2 


1 


na 


na 


3.3 


1 


na 


na 


4.1 


1 


1 


1 


4.2 


1 


1 


1 


4.3 


1 


1 


1 


4.4 


4 


2 


3 


4.5 


1 


1 


1 


5.1 


1 


1 


1 


5.2 


1 


1 


1 


5.3 


1 


2 


1 


5.4 


1 


na 





5.5 


6 


9 


2 



Table 4: DeWild's handling of the first 22 questions 
from TREC 2004 



which can be further edited for correctness. 

7. RELATED WORK 

There is a large body of work on question answering. 
Many systems use a combination of NLP techniques (deep 
or shallow), learning algorithms and hand-crafted rules to 
classify the questions and to establish relationships between 
terms of a question and a possible answer sentence (e.g. [171 
112)). Despite some overlap, there are fundamental differ- 
ences between our work and the work on question answer- 
ing. Our work is not a replacement for QA systems that 
can often handle complex questions; it is rather a natural 
way of expressing short questions or extraction targets. It 
is possible to integrate our work within a question answer- 
ing system if natural language questions can be mapped to 
DeWild queries. 

Large-scale data extraction from the Web has been the 
subject of various recent work [1]. In particular, Brin [7] 
suggests an algorithm which takes a small number of ex- 
amples of a class as a seed set and extracts more examples 
of the same class. His algorithm learns a set of extraction 
patterns for each page (or pages with the same URL prefix) 
that contains some of the examples and use those patterns to 
extract more tuples from the same page(s). This algorithm 
does a good job when data is structured in a tabular format 
but is not expected to work on free text. This is because it 
is generally unlikely to find more than one example (of the 
seed set) in a text document such that their surrounding 
texts are the same. 

KnowItAll [13] takes the description of a concept or class 
(e.g. cities) as input and extracts instances (e.g. Paris, 
New York, . . . ) of the class. The system maintains a set of 
rules which can be instantiated with an input class to pro- 
duce keywords that must be used to extract the instances 



of the class. KnowItAll uses co-occurrence statistics, specif- 
ically mutual information, to assess the relatedness of each 
instance. Our approach differs from KnowItAll in several 
important aspects; First, the query-based interface and the 
support of wild cards make DeWild more adaptive to dif- 
ferent extraction tasks. Second, unlike KnowItAll where a 
concise description of a class must be given, a DeWild query 
may specify only the context in which the instances may 
appear, and this is useful when a concise class description is 
not available (this problem is somewhat addressed in Tex- 
tRunner [3])- Last but not least, our approach of porting 
link-based ranking algorithms for assessing extraction re- 
sults from text is novel and performs better than the one 
used in KnowItAll. 

Related to our query rewritings is the work on query ex- 
pansion [4] and query transformation for question answering 
l^. Query expansion has shown to be difficult for phrase 
queries. Also query expansion and transformation tech- 
niques are not directly applicable to queries with wild cards. 
Our queries can benefit from inverted indexes on terms, 
phrases, and N-grams and there are already some works in 
these areas (e.g. [51 [TD]). Other related work includes the 
work on Web query languages and wrappers (see 111] for a 
survey of this area before 1998) . These works can be used to 
extract data from a specific site or a set of pages with sim- 
ilar structures but are not generally applicable to free text. 
Finally, Google's fiU-in-the-blank is related to our wild cards 
but is different. Google returns a ranked list of pages for a 
fill-in-the-blank search but the ranking is different (and the 
detail is not published). Our evaluation includes Google's 
fill-in-the-blank (referred to as Google*). 

8. CONCLUSIONS 

We presented a framework for querying and large-scale 
data extraction from natural language text, and evaluated 
the effectiveness of our framework within a few data extrac- 
tion tasks on the Web. We analyzed our rankings of the 
results in terms of the stability and the locality of the scor- 
ing functions and conducted experiments comparing their 
effectiveness in terms of precision and recall. Our querying 
interface is intuitive for writing queries and scalable to large 
number of rewritings and with more wild cards. 

Our work opens a few interesting directions for future 
work. First, it would be interesting to study other wild 
cards, querying schemes and formalisms that are simple for 
writing queries and at the same time have a well-defined syn- 
tax and semantics for query evaluations and optimizations. 
We consider our wild card queries as a first step toward more 
formal querying of natural language text. Second area for 
possible future work is data storage and indexing. Despite 
the extensive work on indexing free text, there is not much 
work on indexing natural language text in particular. Given 
that natural language text can be parsed, there is much room 
for research on building indexes that are aware of sentence 
structures and queries. Third, a more formal query syntax 
and semantics opens the door for more study on mapping 
queries to evaluation plans and access path selection and op- 
timization. These are important issues when querying large 
repositories and/or posing complex queries. Another area 
for further study is on extracting n-ary relations for n > 3; 
the problem in general is difficult since the columns of target 
rows can be scattered in multiple sentences. Yet one more 
area is on mapping natural language questions to more for- 



mal queries that can be efficiently evaluated. 
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APPENDIX 

We give two definitions before proving our theorems. 

Definition 3. The normalized Kendall tau distance be- 
tween two n-dimensional real vectors wi and W2 ts defined 
as 

2 " " 

^ > i=l j = l 

where 

, _ ( 1 if wi{i) < wi{j) AND W2{i) > W2{j) 
J^i.^2(hJ) - I otherwise. 

The Kendall tau distance measures the number of pairwise 
disagreements between two (ranked) lists. The Kendall tau 
distance changes if there is a change in ordering, but any 
change in the actual scores would not affect the Kendall tau 
distance if the ordering remains the same. 

Definition 4. The normalized Manhattan distance be- 
tween two n-dimensional real vectors Wi and W2 is defined 
as 

1 " 

di{wi,W2) = — 7 |™i(j) - W2(i)\. 
n ^ — ' 

i = l 

Theorem\l\ The scoring function NPatterns is stable under 
both dkt and di . 

Proof. Let G' = G\{ei, . . . , efe} and {1,...,/} be the 
set of tuples in G' which are affected by the change. Clearly 
I < k. 

(Case for dkt) Consider the ordering of the tuples under 
NPatterns(G). With every edge ei removed from the graph, 
the ordering can be disturbed by n substitutions at most. 
This extreme case arises when all tuples have the same scores 
and removing a moves a tuple to the bottom of the list. In 
total, there are at most k tuples affected. Therefore the 
maximum for dkt{w\,W2) < 2{kn)/n{n — 1) = 2k/ {n — 1), 
and lim„^oo dkt{wi,W2) — 0. 

( Case for d\ ) The sum of the changes in scores for affected 
nodes cannot exceed k whereas this sum for non-affected 
nodesisO. Thus lim„_cx) rfi (wi, W2) = lim„^oo fe/n = 0. □ 

Theorem [3 NPages is stable under dkt ■ NPages is also 
stable under di if w(p t) bound to a fixed constant. 

Proof. (Case for dkt) The stability proof is similar to 
the one given for NPatterns. With every edge d removed 



from the graph, the ordering can be disturbed by n substi- 
tutions at most. With at most kn substitutions in total, 
hni„^oo dfet(™i, W2) = 0. 

(Case for dkt) Suppose w{p — > t) is bound to a constant c. 
The sum of the changes in scores for affected nodes cannot 
exceed cfc, and the rest follows. □ 



